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Abstract 



Given a sample covariance matrix, we examine the problem of maximizing the variance 
explained by a linear combination of the input variables while constraining the number 
of nonzero coefficients in this combination. This is known as sparse principal component 
analysis and has a wide array of applications in machine learning and engineering. We 
formulate a new semidefinite relaxation to this problem and derive a greedy algorithm 
that computes a full set of good solutions for all target numbers of non zero coefficients, 
with total complexity O(n^), where n is the number of variables. We then use the same 
relaxation to derive sufficient conditions for global optimality of a solution, which can be 
tested in 0{n^) per pattern. We discuss applications in subset selection and sparse recovery 
and show on artificial examples and biological data that our algorithm does provide globally 
optimal solutions in many cases. 

Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso. 
1. Introduction 

Principal component analysis (PCA) is a classic tool for data analysis, visualization or com- 
pression and has a wide range of applications throughout science and engineering. Starting 
from a multivariate data set, PCA finds linear combinations of the variables called prin- 
cipal components, corresponding to orthogonal directions maximizing variance in the data. 
Numerically, a full PCA involves a singular value decomposition of the data matrix. 

One of the key shortcomings of PCA is that the factors are linear combinations of all 
original variables; that is, most of factor coefficients (or loadings) are non-zero. This means 
that while PCA facilitates model interpretation and visualization by concentrating the in- 
formation in a few factors, the factors themselves are still constructed using all variables, 
hence are often hard to interpret. 



In many applications, the coordinate axes involved in the factors have a direct physi- 
cal interpretation. In financial or biological applications, each axis might correspond to a 
specific asset or gene. In problems such as these, it is natural to seek a trade-off between 
the two goals of statistical fidelity (explaining most of the variance in the data) and in- 
terpretability (making sure that the factors involve only a few coordinate axes). Solutions 
that have only a few nonzero coefficients in the principal components are usually easier 
to interpret. Moreover, in some applications, nonzero coefficients have a direct cost {e.g., 
transaction costs in finance) hence there may be a direct trade-off between statistical fidelity 
and practicality. Our aim here is to efficiently derive sparse principal components, i.e, a set 
of sparse vectors that explain a maximum amount of variance. Our belief is that in many 
applications, the decrease in statistical fidelity required to obtain sparse factors is small and 
relatively benign. 

In what follows, we will focus on the problem of finding sparse factors which explain a 
maximum amount of variance, which can be written: 



max z hz 

I|2|I<1 



pCaLrd{z) 



(1) 



in the variable z £ R", where S G S„ is the (symmetric positive semi-definite) sample 
covariance matrix, p is a parameter controlling sparsity, and Card(2:) denotes the cardinal 
(or if) norm) of z, i.e. the number of non zero coefficients of z. 

While PCA is numerically easy, each factor requires computing a leading eigenvec- 
tor, which can be done in 0(?i^), sparse PCA is a hard combinatorial problem. In fact, 
Moghaddam et al. (l2006bl') show th at the subset selection problem for ordinary least squares, 
which is NP-hard (jNataraia can be reduced to a sparse generalized eigenvalue prob- 

lem, of which sparse PCA is a particular intance. Sometimes ad hoc "rotation" techniques 
are used to post-process the results f r om P CA and find interpretable directions underly- 
ing a particular subspace (see I Jolliffd ()1995)). Another simple s olutio n is to threshold the 
loadings with small absolute value to zero (jCadima and Jolliffd . Il995l ) . A more systematic 
approach to the problem arose in rece nt years, with vari ous researc hers proposing noii - 



convex algorithms (e.g., SCoTLASS bvlJolliffe 



or D.C. based methods (jSriperumbudur et al.l . 120071 ) which find modified principal compo- 



et a 



^2Q0i ) , SLRA by Izhang et al.l (120021 ) 



nents with zero loadings. The SPCA algorithm, which is I j ased on the representation of 
PCA as a regre ssion-type o ptimi zation problem ( Zou et al. . 20061 ). allows the application 



of the LASSO dTibshirani Ii99(tI ). a penalization technique based on the ii norm. With 



the exception of simple t hresholding, all the algorith ms above require solving non convex 
problem; Recently also, Id'Asnremont et a1.l derived an £i based semidefinite re- 

laxation for_Jh£_s£axse_^C7\__groW ([1]) with a complexity of 0{n^^/logn) for a given p. 
Finally, Moghaddam et al. ( 2006al ) used greedy search and branch-and-bound methods to 
solve small instances of problem ([T]) exactly and get good solutions for larger ones. Each 
step of this greedy algorithm has complexity O(n^), leading to a total complexity of O(n^) 
for a full set of solutions. 

Our contribution here is twofold. We first derive a greedy algorithm for computing a 
full set of good solutions (one for each target sparsity between I and n) at a total numerical 
cost of O(n^) based on the convexity of the of the largest eigenvalue of a symmetric matrix. 
We then derive tractable sufficient conditions for a vector z to be a global optimum of ([1]). 
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This means in practice that, given a vector z with support /, we can test if z is a globally 
optimal solution to problem ([1]) by performing a few binary search iterations to solve a 
one dimensional convex minimization problem. In fact, we can take any sparsity pattern 
candidate from an y algorithm and tes t its op timality. This paper builds on the earlier 



conference version (jd'Aspremont et all l2007al ). providing new and simpler conditions for 



optimality and describing applications to subset selection and sparse recovery. 

While there is certai n ly a c ase to be made for i\ penalized maximum eigenvalues 

(a la d'Aspremont et al. ( 2007bl ) ), we strictly focus her e on the formulati on. How- 

ever, it was shown recentl y (see Candes and Tao ( 2005 ). Donoho and Tanner ( 20051 ) or 
Meinshausen and Yu ( 20061 ) among others) that there is in fact a deep connection between 
constrained extremal eigenvalues and LASSO type variable selection algorithms. Suf- 
fic ient conditions b a sed o n sparse eigenvalues (also called restricted isometry constants 
in Candes and Taol ( 2005 )) guarantee consistent variable selection (in the LASSO case) 
or sparse recovery (in the decoding problem). The results we derive here produce upper 
bounds on sparse extremal eigenvalues and can thus be used to prove consistency in LASSO 
estimation, prove perfect recovery in sparse recovery problems, or prove that a particular 
solution of the subset selection problem is optimal. Of course, our conditions are only suffi- 
cient, not necessary (which would contradict the NP-Hardness of subset selection) and the 
duality bounds we produce on sparse extremal eigenvalues cannot always be tight, but we 
observe that the duality gap is often small. 

The paper is organized as follows. We begin by formulating the sparse PCA problem in 
Section [2l In Section [3l we write an efficient algorithm for computing a full set of candidate 
solutions to problem ([1]) with total complexity 0(?i'^). In Section S] we then formulate a 
convex relaxation for the sparse PCA problem, which we use in Section [5] to derive tractable 
sufficient conditions for the global optimality of a particular sparsity pattern. In Section [6] 
we detail applications to subset selection, sparse recovery and variable selection. Finally, in 
Section [71 we test the numerical performance of these results. 



Notation 

For a vector z G R, we let ||z||i = l^^il and ||z|| = (X^"=i -Zi^)""^^^) Card(z) is the 

cardinality of z, i.e. the number of nonzero coefficients of z, while the support I of z 
is the set {i : Zi ^ 0} and we use I'^ to denote its complement. For /? G R, we write 
= max{/3, 0} and for X € Sn (the set of symmetric matrix of size nxn) with eigenvalues 
Aj, Tr(X)_|_ = Y17=i ™cix{Aj, 0}. The vector of all ones is written 1, while the identity matrix 
is written I. The diagonal matrix with the vector u on the diagonal is written diag(M). 



2. Sparse PCA 

Let S G S„ be a symmetric matrix. We consider the following sparse PCA problem: 

(/)(/3) = max z'^TiZ — pCard(z) (2) 

||2||<1 

in the variable z G R" where p > is a parameter controlling sparsity. We assume without 
loss of generality that S G S„ is positive semidefinite and that the n variables are ordered 
by decreasing marginal variances, i.e. that Sn > . . . > We also assume that we are 
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given a square root A of the matrix S with S = A^A, where A G and we denote by 

ai, . . . , fln £ R-" the columns of A. Note that the problem and our algorithms are invariant 
by permutations of S and by the choice of square root A. In practice, we are very often 
given the data matrix A instead of the covariance S. 

A problem that is directly related to ([5]) is that of computing a cardinality constrained 
maximum eigenvalue, by solving: 

maximize z'^'Ez 

subject to Card(2;) < k (3) 

in the variable z G R". Of course, this problem and ([2|) are related. By duality, an upper 
bound on the optimal value of ([3]) is given by: 

inf (j)(p) + pk. 

where P is the set of penalty values for which 4'{p) has been computed. This means in 
particular that if a point z is provably optimal for ([2|), it is also globally optimum for ([3]) 
with k = Card(z). 

We now begin by reformulating ([2]) as a relatively simple convex maximization problem. 
Suppose that p > Sn. Since z'^Ylz < \zi[f' and \zi\)'^ < \\z\\'^ Card(z) for 

all z S R"", we always have: 

(/>(/?) = max[[j.||<i z'^Sz — /9 Card(z) 
< (Ell -p)Card(z) 
<0, 

hence the optimal solution to ([2]) when p > Sn is z = 0. From now on, we assume p < Sn 
in which case the inequality ||z|| < 1 is tight. We can represent the sparsity pattern of a 
vector z by a vector u G {0, 1}" and rewrite ([2]) in the equivalent form: 

<j){p) = max Amax(diag(n)Sdiag(ii)) - pl^n 

«e{o,i}" 

= max X^s^x(diag(u)A'^ Adiag(u)) — pl^u 

«G{0,1}" 

= max Amax(^diag(u)A^) - pl^-u, 

«e{o,i}" 

using the fact that diag(n)^ = diag(ti) for all variables u G {0, l}" and that for any matrix 
B, XraUB^B) = Amax(55^). We then have: 

(l){p) = max Amax(-4diag(n)^^) - pl^u 

nG{0,l}" 

= max max x"^Adiag(n)^"^x — /ol"^tt 

|a:||=l ue{0,l}" 

n 

= max max uAiaJ x)"^ — p) . 

\\x\\=l «G{0,1}"'^ 

Hence we finally get, after maximizing in u (and using max^,g|o^i} [iv = /3+): 

71 

(I^{P) = max V((af x)2 - (4) 

11-11=1^ 



4 



which is a nonconvex problem in the variable x G R". We then select variables i such that 
{aJxY — p > 0. Note that if Sjj = afui < p, we must have {afx)"^ < ||ai|p||x|p < p hence 
variable i will never be part of the optimal subset and we can remove it. 



3. Greedy Solutions 

In this section, we focus on finding a good solution to problem ([2]) using greedy methods. We 
first present very simple preprocessing solutions with complexity 0(n log n) and O(n^). We 
then recall a simple greedy algorithm with complexity O(n^). Finally, our first contribution 
in this section is to derive an approximate greedy algorithm that computes a full set of 
(approximate) solutions for problem ([2]), with total complexity 0{n^). 



3.1 Sorting and Thresholding 

The simplest ranking algorithm is to sort the diagonal of the matrix S and rank the variables 
by variance. This works intuitively because the diagonal is a rough proxy for the eigenvalues: 
the Schur-Horn theorem states that the diagonal of a matrix majorizes its eigenvalues 
( Horn and Johnson . 19851 ): sorting costs 0(n log n). Another quick solution is to compute 
the leading eigenvector of S and form a sparse vector by thresholding to zero the coefficients 
whose magnitude is smaller than a certain level. This can be done with cost O(n^). 



3.2 Full greedy solution 

Following Moghaddam et al. ( 2006al ). starting from an initial solution of cardinality one at 
p = Sii, we can update an increasing sequence of index sets Ik C [l,n], scanning all the 
remaining variables to find the index with maximum variance contribution. 



Greedy Search Algorithm. 

• Input: S G R"^*" 

• Algorithm: 

1. Preprocessing: sort variables by decreasing diagonal elements and permute ele- 
ments of S accordingly. Compute the Cholesky decomposition S = A^A. 

2. Initialization: Ji = {1}, xi = ai/||ai||. 

3. Compute ik = argmax^^^^ Amax (l]je/feU{i} "i^j) • 

4. Set Ik+i = /fcUjifc} and compute x^+i as the leading eigenvector of Xlje/fe+i '^j^J- 

5. Set k = k + I. If k < n go back to step 3. 

• Output: sparsity patterns Ik- 

At every step, Ik represents the set of nonzero elements (or sparsity pattern) of the 
current point and we can define Zk as the solution to problem ([2|) given Ik, which is: 

Zk = argmax z'^Y^z — pk, 

{Zjc=0, \\z\\=l} 
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which means that Zk is formed by padding zeros to the leading eigenvector of the submatrix 
Tij^j^. Note that the entire algorithm can be written in terms of a factorization S = 
A^A of the matrix S, which means significant computational savings when S is given as 
a Gram matrix. The matrices and X^jg/^ o,iaf have the same eigenvalues and their 

eigenvectors are transformed of each other through the matrix A, i.e., if z is an eigenvector 
of then is an eigenvector of Aj^A^^. 

3.3 Approximate greedy solution 

Computing n — k eigenvalues at each iteration is costly and we can use the fact that uu ^ 



is a s ubgradient of Amax at X if ii is a leading eigenvector of X (jBovd and Vandenberghe 



20041 ^. to get: 




> Amax ^ ajOj + (xlaif, (5) 
J 

which means that the variance is increasing by at least (x^aj)^ when variable i is added 
to Ik- This provides a lower bound on the objective which does not require finding n — k 
eigenvalues at each iteration. We then derive the following algorithm: 

Approximate Greedy Search Algorithm. 

• Input: S G R'"^'' 

• Algorithm: 

1. Preprocessing. Sort variables by decreasing diagonal elements and permute ele- 
ments of S accordingly. Compute the Cholesky decomposition S = A^A. 

2. Initialization: Ii = {!}, xi = ai/||ai||. 

3. Compute ifc = argmaxj^/^(x^aj)^ 

4. Set Ik+i = /fcUjifc} and compute x^+i as the leading eigenvector of ^j^J- 

5. Set k = k + 1. If k < n go back to step 3. 

• Output: sparsity patterns Ik- 



Again, at every step, Ik represents the set of nonzero elements (or sparsity pattern) of 
the current point and we can define Zk as the solution to problem ([2]) given Ik, which is: 

Zk = argmax z'^'Sz — pk, 

{zic=0, \\z\\=l} 

which means that Zk is formed by padding zeros to the leading eigenvector of the submatrix 
Better points can be found by testing the variables corresponding to the p largest 
values of {xj^aiY instead of picking only the best one. 
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3.4 Computational Complexity 



The complexity of computing a greedy regularization path using the classic greedy algorithm 
in Section r3.2l is O(n^): at each step k, it computes (n — k) maximum eigenvalue of matrices 
with size k. The approximate algorithm in Section [3.31 computes a full path in O(n^): the 
first Cholesky decomposition is O(n^), while the complexity of the k-th iteration is 0{k'^) 
for the maximum eigenvalue problem and 0{v?) for computing all products {x^aj). Also, 
when the matrix S is directly given as a Gram matrix A with A G R'^^" with q < n, 
it is advantageous to use A directly as the square root of E and the total complexity of 
getting the path up to cardinality p is then reduced to 0{p^ +p'^n) (which is 0{p^) for the 
eigenvalue problems and 0{p'^n) for computing the vector products). 



4. Convex Relaxation 

In Section [21 we showed that the original sparse PCA problem ([2]) could also be written as 
in (gD: 

n 

(I^{P) = max V((af x)^ - p)+. 

1=1 

Because the variable x appears solely through X = xx'^ , we can reformulate the problem 
in terms of X only, using the fact that when ||x|| = 1, X = xx'^ is equivalent to Tr[X) = 1, 
X ^ and Rank(X) = 1. We thus rewrite (j4|) as: 

(f>{p)= max. Y.7=ii"'I^(^i- P)+ 

s.t. Tr{X) = 1, Rank(X) = 1 

xyo. 

Note that because we are maximizing a convex function over the convex set (spectahedron) 
An = {X G S„ : Tr{X) = 1, X >: 0}, the solution must be an extreme point of A„ 
(i.e. a rank one matrix), hence we can drop the rank constraint here. Unfortunately, 
X I— > (aJXai — p)+, the function we are maximizing, is convex in X and not concave, which 
means that the above problem is still hard. However, we show below that on rank one 
elements of A„, it is also equal to a concave function of X, and we use this to produce a 
semidefinite relaxation of problem Q. 

Proposition 1 Let A G R"^", p > and denote by ai, . . . ,an G R" the columns of A, an 
upper bound on: 

(j){p)= max. Y^'^^^{afXai-p)+ 

s.t. Tr{X) = 1, X ^ 0, Rank(X) = 1 ^ ^ 

can be computed by solving 

^Pip)= max. jyi^,TriXy^B,Xy^)+ 

s.t. Tr{X) = 1, X hO. ^ ' 

in the variables X G S„, where Bi = a^aj — pi, or also: 

^{p)= max. E7=l^^iP^B^) 

s.t. Tr{X) = 1, X ^ 0, Xy Pi^O, 
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which is a semidefinite pTOQTQjTTi in the vdviublcs X G S^^, G S^^. 

Proof We let X^/^ denote the positive square root (i.e. with nonnegative eigenvalues) of 
a symmetric positive semi-definite matrix X. In particular, if X = xx^ with ||x|| = 1, then 
X^/"^ = X = xx"^ , and for all /3 G R, [ixx'^ has one eigenvalue equal to /3 and n — 1 equal 
to 0, which implies Tr(/?xx'^)+ = /?+. We thus get: 

{afXai — /?)+ = Tr{{afxx'^ai — p)xx'^)^ 
= Tr{x{x'^ aittfx — 

= Tr(Xi/2a,afxi/2 - pX)+ = Tr{X'/^{aiaJ - pI)X'/^)+. 

For any symmetric matrix B, the function X i— > Tr(X^/^i?X^/^)+ is concave on the set of 
symmetric positive semidefinite matrices, because we can write it as: 

Tr(X^/^BX^/'^)+ = max Tr(PB) 
{0<P<X} 

= min Tr(YX), 

{YtB, YtO} 

where this last expression is a concave function of X as a pointwise minimum of affine 
functions. We can now relax the original problem into a convex optimization problem by 
simply dropping the rank constraint, to get: 

^P{p)^ max. ^^^^ Tr(XV2a,af _ 
s.t. Tr{X) = 1, XhO, 

which is a convex program in X G S^. Note that because Bi has at most one nonnegative 
eigenvalue, we can replace Tr{X^/'^aiafX^/'^ — pX)+ by X^^.A^^^'^ (^i^'I ^^^'^ ~ P^)+ the 
above program. Using the representation of Tr{X^^'^BX^^'^)^ detailed above, problem ([7]) 
can be written as a semidefinite program: 

^P{p)= max. Er=iTr(^*^0 

s.t. Tr{X) = 1, XtO, XtPihO, 

in the variables X G Sn, -Pi G S„, which is the desired result. ■ 



Note that we always have il^{p) > (pip) when the solution to the above semidefinite 
program has rank one, il^{p) = (pip) semidefinite relaxation ([8]) is tight. This simple 

fact allows us to derive sufficient global optimality conditions for the original sparse PC A 
problem. 

5. Optimality Conditions 

In this section, we derive necessary and sufficient conditions to test the optimality of solu- 
tions to the relaxations obtained in Sections EJ as well as sufficient condition for the tightness 
of the semidefinite relaxation in ([8]). 
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5.1 Dual problem and optimality conditions 

We first derive the dual problem to ([8]) as well as the Karush-Kuhn- Tucker (KKT) optimality 
conditions: 

Lemma 2 Let A G R"^", p > and denote by ai, . . . ,an G R" the columns of A. The 
dual of problem 

^(p)= max. Er=iTr(^^^0 

s.t. Tr{X) = 1, X hO, X hPi^O, 

in the variables X £ Sn, Pi G Sn, is given by: 

min. Amax (EHi ^i) /^^ 
s.t. YiyBi,YiyO,i = l,...,n. ^ ' 

in the variables Yi £ Sn- Furthermore, the KKT optimality conditions for this pair of 
semidefinite programs are given by: 

{X - Pi)Y, = 0, P,B, = PiYi (10) 

^ >z B,, Y,,X,Pi ^ 0, X ^ P„ TrX = 1. 
Proof Starting from: 

max. YJU^^{P^B^) 
s.t. Q^Pi^X 

Tr{X) = 1, X ^ 0, 

we can form the Lagrangian as: 

n 

LiX, P,,Yi) = Tr(P,5,0 + Tr{Y,{X - Pi)) 
1=1 

in the variables X, Pi, Yi G S„, with X, Pi, Yi h and Tr{X) = 1. Maximizing L{X, Pi,Yi) 
in the primal variables X and Pi leads to problem (|9l). The KKT cond itions for this primal- 



dual pair of SDP can be derived from Boyd and Vandenberghel (2004, p. 267). 



5.2 Optimality conditions for rank one solutions 

We now derive the KKT conditions for problem ([8]) for the particular case where we are 
given a rank one candidate solution X = xx^ and need to test its optimality. These 
necessary and sufficient conditions for the optimality of X = xx"^ for the convex relaxation 
then provide sufficient conditions for global optimality for the non-convex problem ([2]). 
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Lemma 3 Let A G R"^", p > and denote by ai, . . . ,an G R" the columns of A. The 
rank one matrix X = xx'^ is an optimal solution of ^ if and only if there are matrices 
Yi G Sn, i = 1, . . . ,n such that: 

Amax (Er=i = E.e/((«f - P) 
x^y.^ = { ^f-f ' ^ ^ (11) 

. yi h Bi, Y, h 0. 

where Bi = aiaf — pi, i = 1, . . . ,n and P is the complement of the set I defined by: 

max(af x)^ < p < min(of x)^. 

Furthermore, x must be a leading eigenvector of both "^i^j aiof and EiLi^- 

Proof We apply Lemma [2] given X = xx'^. The condition ^ Pj ^ xx"^ is equivalent to 
Pi = ttixx^ and ai G [0, 1]. The equation PiBi = XYi is then equivalent to ai{x'^BiX — 
x'^Yix) = 0, with x^BiX = (af x)^ — p and the condition {X — Pi)Yi = becomes x'^Yix{l — 
Oi) = 0. This means that x^y^x = {{ajx)^ — /))+ and the first-order condition in (fTO]l 
becomes Amax (EiLi yi) ~ (Er=i y^) ^- Finally, we recall from Section [2] that: 

n 

J2tGlii"'I^f - P) = y]^^i((af2;)2 -p) 

X =1 nG|0,l}" — 7 
1=1 

= max X^i^x{Adiag(u)A'^) — pl^u 

we{o,i}" 



hence x must also be a leading eigenvector of E/g/ ^i^- 



The previous lemma shows that given a candidate vector x, we can test the optimality 
of X = xx^ for the semidefinite program ([7]) by solving a semidefinite feasibility problem in 
the variables Yi G S^- If this (rank one) solution xx"^ is indeed optimal for the semidefinite 
relaxation, then x must also be globally optimal for the original nonconvex combinatorial 
problem in ([2]), so the above lemma provides sufficient global optimality conditions for the 
combinatorial problem ([2]) based on the (necessary and sufficient) optimality conditions for 
the convex relaxation ([7]) given in lemmaEJ In practice, we are only given a sparsity pattern 
/ (using the results of Section [3] for example) rather than the vector x, but Lemma [3] also 
shows that given /, we can get the vector x as the leading eigenvector of Eie/"*'^?"- 

The next result provides more refined conditions under which such a pair (/, x) is optimal 
for some value of the penalty p > based on a local optimality argument. In particular, 
they allow us to fully specify the dual variables Yi for i € I. 

Proposition 4 Let A G R"^", p > and denote by ai, . . . ,an G R" the columns of A. Let 
X be the largest eigenvector of^i^jUiaJ. Let L be such that: 

max(afx)^ < p < min(afx)^, (12) 
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the matrix X = xx^ is optimal for problem if and only if there are matrices 1^ € S"" 
satisfying 

A,„„ I V iilii^ + y: 1 < TM':? - P), (13) 



with Yi^Bi- ^$^0^, Yi y 0, where Bi = OioJ -pi, i = l,...,n. 

Proof We first prove the necessary condition by computing a first order expansion of the 
functions Fi : X ^ Tr{X^/^ BiX^/^)+ around X = XX . The expansion is based on the 
results in Appendix[X] which show how to compute derivatives of eigenvalues and projections 
on eigensubspaces. More precisely, Lemma [TU] states that \ix^ Bx > 0, then, for any y ^ 0: 

FMl - t)xx^ + tY) = Fi(xx^) + - J—TrBiXX^Bi(Y - xx^) + Oit^'"^), 

BiX 

while if x^Bx < 0, then, for any Y y 0,: 

F,((l - t)xx^ + tY) =t^Tr (^B,^ _ ^^^^ y + o(,3/2)_ 

Thus if X = xx'^ is a global maximum of ^^iFi^X), then this first order expansion must 
reflect the fact that it is also local maximum, i.e. for ah y G S" such that Y ^ and 
Try = 1, we must have: 

1 " 

lim - - t)xx^ + tY) - Fi{xx^)] < 0, 

1=1 

which is equivalent to: 

BjXx"'" Bj\ „ BjXX^ B, 



iG/ Vie/ / ie-f: ^ ^ ^ ^ + 

Thus if X = xx'^ is optimal, with a = X^jg/ a^^^-BjX, we get: 



max 



which is also in dual form (using the same techniques as in the proof of Proposition [T|) : 



min Amax 



which leads to the necessary condition. In order to prove sufficiency, the only non trivial 
condition to check in Lemma [3] is that x'^YiX = for i G I'^, which is a consequence of the 
inequality: 



sr^BiXx'^Bi sr^^A , , sr^Bixx^Bi sr^^A . ^f^BiXx'^Bi 
i&i iei" / Kiel iai"" / , 



11 



This concludes the proof. ■ 

The original optimality conditions in ^ are highly degenerate in Yi and this result refines 
these optimality conditions by invoking the local structure. The local optimality analysis in 
proposition m gives more specific constraints on the dual variables Yi. For i £ I,Yi must be 
equal to Bixx^ Bi/ BiX, while if i G I'^, we must have Yi^ Bi — Bixx^ Bi/x^ BiX, which 
is a stricter condition than Yi >z Bi (because x'^BiX < 0). 

5.3 Efficient Optimality Conditions 

The condition presented in Proposition H] still requires solving a large semidefinite program. 
In practice, good candidates for Yi, i G I'^ can be found by solving for minimum trace 
matrices satisfying the feasibility conditions of proposition HI As we will see below, this can 
be formulated as a semidefinite program which can be solved explicitly. 

Lemma 5 Let A G R"^'", p > 0, x £ R" and Bi = Uiof — pi with oi, . . . , a„ G R" the 

columns of A. If (ajx)"^ < p and \\x\\ = 1, an optimal solution of the semidefinite program: 

minimize Tr Yi 

subject to Y^hBi- te^, x^YiX = 0, Yi ^ 0, 



is given by: 



{p-{alxy)} \\{I-xx^)ai\\^ 



Proof Let us write Mi = Bi - ^i^^-^, we first compute: 



af Mitti = {af ai - p)ai ai - ^ ^ ^ ^ 



[aJxY - p 

{afai — p) , T / T \2\ 
p - {af xy 

When a^ai < p, the matrix Mj is negative semidefinite, because ||3;|| = 1 means ajMoi < 
and x'^Mx = ajMx = 0. The solution of the minimum trace problem is then simply Yi = 0. 
We now assume that afoi > p and first check feasibility of the candidate solution Yi in ()14p . 
By construction, we have Yi ^ and YiX = 0, and a short calculation shows that: 

a^ Yiai = p- . r n2n ("» - ("i ^) ) 

{p - {aj xY) 

= ajMiai. 

We only need to check that Yi ^ Mi on the subspace spanned by and x, for which there 
is equality. This means that Yi in (jl4p is feasible and we now check its optimality. The dual 
of the original semidefinite program can be written: 

maximize Tr Pi Mi 

subject to I — Pj + uxx"^ >z 
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and the KKT optimality conditions for this problem are written: 

Yi{I -Pi + uxx^) = 0, Pi{Y, - Mi) = 0, 
I - Pi + vxx^ >z 0, 

Pi ^0, Yih 0, Yi >z Mi, Yixx^ = 0, i G r. 

Setting Pi = YiTrYi/TrY^ and u sufficiently large makes these variables dual feasible. 
Because all contributions of x are zero, TrYi{Yi - Mi) is proportional to Tr aiaf{Yi - Mi) 
which is equal to zero and Yi in (I14p satisifies the KKT optimality conditions. ■ 



We summarize the results of this section in the theorem below, which provides sufficient 
optimality conditions on a sparsity pattern /. 

Theorem 6 Let A £ R"^", p > 0, E = A^A with oi,...,an G R" the columns of A. 
Given a sparsity pattern I, setting x to be the largest eigenvector of^^^i OioJ , if there is a 
/?* > such that the following conditions hold: 



T \2 

ai X) 



max(afx)^ < p* < min(afx)^ and Xmax / < 7 (( 

\i=l / i&I 

with the dual variables Yi for i £ P defined as in and: 

BiXx^Bi 

= — t^t; 1 when i E I, 

x^ BiX 

then the sparsity pattern I is globally optimal for the sparse PCA problem (0j with p = p* 
and we can form an optimal solution z by solving the maximum eigenvalue problem: 

rp 

z = argmax z Sz. 

{^jc=0, \\z\\=l} 

Proof Following proposition H] and lemma [5l the matrices Yi are dual optimal solutions 
corresponding to the primal optimal solution X = xx^ in d?]). Because the primal solution 
has rank one, the semidefinite relaxation ([8]) is tight so the pattern / is optimal for ([2j) and 
Section [2] shows that z is a globally optimal solution to ^ with p = p*. ■ 



5.4 Gap minimization: finding the optimal p 

All we need now is an efficient algorithm to find p* in theorem [6l As we will show below, 
when the dual variables Yf are defined as in ()14p , the duality gap in ([2]) is a convex function 
of p hence, given a sparsity pattern I, we can efficiently search for the best possible p (which 
must belong to an interval) by performing a few binary search iterations. 

Lemma 7 Let A £ R"^", p >0, T, = A'^A with ai, . . . , a„ G R" the columns of A. Given 
a sparsity pattern L, setting x to be the largest eigenvector of "^i^j aiaj , with the dual 
variables Yi for i £ I'^ defined as in ^14\ ) and: 

BiXx^Bi 
= — t^t; — ' when i £ 1. 
x^ BiX 
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The duality gap in ^ which is given by: 

gap(p) = Amax y}^ - ^((af x)2 - p), 
is a convex function of p when 



maxia,- x) < p < mmlo,- x) . 



Proof For i G / and u G R", we have 



rp _ {u^aiofx - pu^xY 



u YiU 



which is a convex function of p (jBovd and Vandenberghd . |200J, p. 73). For i G J'^, we can 
write: 

p-[aixy V p-\0'iXrJ 

hence max{0, p(a^Oj — p)/{p — (a^x)^)} is also a convex function of p. This means that: 

{ajai - p) \ {u^Oi - (x^n)(x^ai))^ 



u YiU = max < 0, p 



(p-(afx)2)j ||(I-xrE^)a,||2 



IS 



convex in p when i G P. We conclude that Yll=i u Y^u is convex, hence: 



u YiU - }({ai x) - p) 

ir"ii— J- . , — Z 



i=l iel 

is also convex in p as a pointwise maximum of convex functions of p. 



This result shows that the set of p for which the pattern I is optimal must be an interval. 
It also suggests an efficient procedure for testing the optimality of a given pattern /. We 
first compute x as a leading eigenvector "^i^j ciiof . We then compute an interval in p for 
which x satisfies the basic consistency condition: 

max(aj x) = pmm < P < Pmax = min(aj x) . 

Note that this interval could be empty, in which case / cannot be optimal. We then 
minimize gap(/o) over the interval [/Omim Pmax]- If the minimum is zero for some p = p*, then 
the pattern / is optimal for the sparse PC A problem in ^ with p = p*. 

Minimizing the convex function gap{p) can be done very efficiently using binary search. 
The initial cost of forming the matrix Y17=i which is a simple outer matrix product, is 
O(n^). At each iteration of the binary search, a subgradient of ga,p{p) can then be computed 
by solving a maximum eigenvalue problem, at a cost ofO{n^). This means that the complex- 
ity of finding the optimal p over a given interval [pmin, Pmax] is 0{n'^ log2((/0max - Pmin)/e)), 



14 



where e is the target precision. Overall then, the total cost of testing the optimality of a 
pattern I is 0(n^ + ■n? log2((pmax - /Omm)/e))- 

Note that an additional benefit of deriving explicit dual feasible points Yi is that plugging 
these solutions into the objective of problem Q: 

min. Amax {Ya=i ^i) 

s.t. ^ Si, ^ 0, i = 1, . . . ,n. 

produces an upper hound on the optimum value of the original sparse PCA problem ^ even 
when the pattern I is not optimal (all we need is a p satisfying the consistency condition). 



5.5 Solution improvements and randomization 

When these conditions are not satisfied, the relaxation ([8]) has an optimal solution with 
rank strictly larger than one , hence is not tight . At su ch a point, we can use a different 
relaxation such as DSPCA bv ld'Aspremont et al.l (|2007bl l to try to get a better solution. We 
can also apply randomization techn iques to improve the quality of the solution of problem 
(P (|Ben-Tal and Nemirovskil . l2002l ^. 



6. Applications 

In this section, we discuss some applications of sparse PCA to subset selection and com- 
pressed sensing. 



6.1 Subset selection 

We consider p data points in R", in a data matrix X G R^^". We assume that we are given 
real numbers y € R^ to predict from X using linear regression, estimated by least squares. 
We are thus looking for w G R" such that ||y — is minimum. In the subset selection 

problem, we are looking for sparse coefficients i.e., a vector w with many zeros. We thus 
consider the problem: 

s{k) = min \\y — Xw\^. (15) 

wgR , Card«)<fc 

Using the sparsity pattern u G {0, 1}", and optimizing with respect to we have 

s{p)= min \\yf -y^X(:u){X{ufX{u)r^X{:ufy, (16) 
Me{o,i}", iTu<k 

where X{u) = Xdiag(ii). We can rewrite y"'" X{u){X{u)'^ X{u))^^ X{u)'^ y as the largest 
generalized eigenvalue of the pair {X{uY'yy'^X{u),X{uyX{u)), i.e., as 

,/Xiu)mufXi,a))-^X(ury = max ""^^f ff'yf ■ 



We thus have: 



II ||2 di&^{u)X^yy^Xdi&g,{u)w 
= \\y\\ - max max tt^. , s^rv TTx • ('-') 



15 



Given a pattern u £ {0, 1}", let 

so = y^X{u){X{ufX{u)r^X{ufy 

be the largest generalized eigenvalue corresponding to the pattern u. The pattern is optimal 
if and only if the largest generalized eigenvalue of the pair {X (v)"^ yy'^ X (v) , X (v)'^ X (v)} is 
less than sq for any v G {0, 1}" such that = u^l. This is equivalent to the optimality 
of u for the sparse PCA problem with matrix X^yy'^X — sqX'^X, which can be checked 
using the sparse PCA optimality conditions derived in the previous sections. 

Note that unlike in the sparse PCA case, this convex relaxation does not immediately 
give a simple bound on the optimal value of the subset selection problem. However, we get 
a bound of the following form: when v S {0, 1}" and w G R" is such that I'^v = k with: 



w 



{Xivfyy'^Xiv) - soX{vfX{v)) w<B, 



where B >0 (because sq is defined from n), we have: 



\y\\~ - So > s{k) > llyll'' - So - -B ( min Amin(^(i') X{v)) 

\ve{o,i}",iTv=k 



> \\y\ 



so-B{X^i^{X^X)) \ 



This bound gives a sufficient condition for optimality in subset selection, for any problem 
instance and any given subset. This is to be coi itrasted with the suff i cient conditions deri ved 
for particular algorithms, such as the LASSO (Yuan and Lin . 2007 . Zhao and Yu . 20061 ) or 
backward greedy selection ( Couvreur and Bresler . 200d ). Note that some of these opti i nality 
conditions are often based on sparse eigenvalue problems (see iMeinshausen and Yul ( 20061 . 
§2)), hence our convex relaxations helps both in checking sufficient conditions for optimality 
(before the algorithm is run) and in testing a posteriori the optimality of a particular 
solution. 



6.2 Sparse recovery 

Following Candes and Taol ( 2005 ) (see also Donoho and Tanner (j2005l )). we seek to recover 
a signal / G R" from corrupted measurements y = Af + e, where A S j^'"^"- ig g, coding 
matrix and e G R™ is an unknown vector of errors with low cardinality. This can be 
reformulated as the problem of finding the sparsest solution to an underdetermined linear 
system: 

minimize ||3;||o 
subject to Fx = Fy 

where ||x||o = Card(2;) and F G RP^™ is a matrix such that FA = 0. A classic trick to get 
good approximate solutions to problem (llSp is to substitute the (convex) £i norm to the 
(combinatorial) £o objective above, and solve instead: 



minimize 

subject to Fx = Fy, 



(19) 



which is equivalent to a linear program in x G R™. Following ICandes and Tad (120051 ). given 
a matrix F G RP^™ and an integer S such that < S < m, we define its restricted isometry 
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constant 5s as the smallest number such that for any subset / C [l,m\ of cardinality at 
most S we have: 

{l-6s)\\cf<\\Fjcf<{l + 5s)\\c\\\ (20) 

for all c G R'^', where Fj is the submatrix of F formed by keeping only the columns of F 
in the set /. The following result then holds. 

Proposition 8 (Candes and Tao (200 A )). Suppose that the restricted isometry con- 
stants of a matrix F G j^px*" satisfy 

Ss + S2S + ^35 < 1 (21) 

for some integer S such that < S < m, then if x is an optimal solution of the convex 
program: 

minimize \\x\\i 
subject to Fx = Fy 

such that Card x < S then x is also an optimal solution of the combinatorial problem: 

minimize \\x\\q 
subject to Fx = Fy. 

In other words, if condition (|2ip holds for some matrix F such that FA = 0, then perfect 
recovery of the signal / given y = Af + e provided the error vector satisfies Card(e) < S. 
Our key observation here is that the restricted isometry constant 6s in condition (j2ip can 
be computed by solving the following sparse maximum eigenvalue problem: 

(1 + 6s) < max. x^(F^F)x 

s. t. Card(x) < S 

\\x\\ = 1, 

in the variable x G R'" and another sparse maximum eigenvalue problem on al - FF^ with 
a sufficiently large, with 6s computed from the tightest one. In fact, ([20|) means that: 

(1 + 6s) < max max cF FT Fic 

{/C[l,m]: \I\<S} \\c\\ = l 

= max max diag(u)-F'^-F diag(n)x 

{ne{0,l}": I'^u-CS] \\x\\=l 

rp rp 

= max X F Fx, 

{||a;||=l, Card(x)<5} 

hence we can compute an upper bound on 6s by duality, with: 

(1 + ^5) < mim+ps 

where (pip) is defined in This means that while Candes and Taol ( 2005 ) obtained an 
asymptotic proof that some random matrices satisfied the restricted isometry condition 
(I2ip with overwhelming probability (i.e. exponentially small probability of failure), when- 
ever they are satisfied, the tractable optimality conditions and upper bounds we obtain in 
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Section [5] for sparse PCA problems allow us to prove, deterministically, th at a finite dimen- 
sional matrix satisfies the restricted isometry condition in OTI) . Note that ICandes and Tad 



arovide a slightly weaker condition than (|21|) based on restricted orthogonality con- 



ditions and extending the results on sparse PCA to these conditions would increase the 
maximum S for which perfect rec overy holds. In pra c tice ho wever, we will see in Section 



17.31 that the relaxations in ([9|) and Id'Aspremont et al.l (j2007bl ) do provide very tight upper 



bounds on sparse eigenvalues of random matrices but solving these semidefinite programs 
for very large scale instances remains a significant challenge. 

7. Numerical Results 

In this section, we first compare the various methods detailed here on artificial examples, 
then test their performance on a biological data set. PathSPCA, a MATLAB code repro- 
ducing these results may be downloaded from the authors' web pages. 

7.1 Artificial Data 

We generate a matrix U of size 150 with uniformly distributed coefficients in [0, 1]. We let 
V G R^^'^ be a sparse vector with: 

if i < 50 
- 50) if 50 < i < 100 
otherwise 

We form a test matrix S = U^'^U + avv^ , where a is the signal-to-noise ratio. We first 
compare the relative performance of the algorithms in Section [3] at identifying the correct 
sparsity pattern in v given the matrix S. The resulting ROC curves are plotted in figure 
[1] for a = 2. On this example, the computing time for the approximate greedy algorithm 
in Section 13.31 was 3 seconds versus 37 seconds for the full greedy solution in Section 13.21 
Both algorithms produce almost identical answers. We can also see that both sorting and 
thresholding ROC curves are dominated by the greedy algorithms. 

We then plot the variance versus cardinality tradeoff curves for various values of the 
signal-to-noise ratio. In figure [2l We notice that the magnitude of the error (duality gap) 
decreases with the signal-to-noise ratio. Also, because of the structure of our problem, there 
is a kink in the variance at the (exact) cardinality 50 in each of these curves. Note that for 
each of these examples, the error (duality gap) is minimal prec i sely at the kink. 




;n oi tnese examples, tne error (^duality gap ) is minimal prec i sely at tne Rmk. 
Next, we use the DSPCA algorithm of Id'Aspremont et al. ( 2007bl ) to fin d better solu- 



tions w here the greedy codes have failed to obtain globally optimal solutions. In ld'Aspremont et al 
(HqotS), it was shown that an upper bound on ([2]) can be computed as: 



(t){p) < min Amax(S-FC/). 

I A<n 



\U^J\<P 

which is a convex problem in the matrix [/ G S„. Note however that the cost of solving 
this relaxation for a single p i s Ofn^Vlogn) versus O(n^ ) for a full path of approximate 
solutions. Also, the results in d'Aspremont et al. ( 2007bl ) do not provide any hint on the 



value of p, but we can use the breakpoints coming from suboptimal points in the greedy 
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False Positive Rate 



Figure 1: ROC curves for sorting, thresholding, fully greedy solutions (Section I3.2p and 
approximate greedy solutions (Section l3.3p for a = 2. 



search algorithms in Section 13.31 and the consistency intervals in Eq. (jl2p . In figure [2] we 
plot the variance versus cardinality tradeoff curve for a = 10. We plot greedy variances 
(solid line), dual upper bounds from Section [5^ (dotted line) and upper bounds computed 
using DSPCA (dashed line). 



7.2 Subset selection 

We now present simulation experiments on synthetic datasets for the subset selection prob- 
lem. We consider data sets generated from a sparse linear regression problem and study 
optimality for the subset selection problem, given the exact cardinality of the generating 
vector. In this setti ng, it is known th at regularization by the .^i-norm, a procedure also 
known as the Lasso ( Tibshirani . 19961 ). will asymptoti cally lead to the cor r ect solution if 



and o nly if a certain consistency condition is satisfied (jYuan and Linl . 120071 . 1 Zhao and Yu 



2006l l. Our results provide here a tractable test the optimality of solutions obtained from 



various algorithms such as the Lasso, forward greedy or backward greedy algorithms. 

In Figure [31 we consider two pairs of randomly generated examples in dimension 16, one 
for which the lasso is provably consistent, one for which it isn't. We perform 50 simulations 
with 1000 samples and varying noise and compute the average frequency of optimal subset 
selection for Lasso and greedy backward algorithm together with the frequency of provable 
optimality (i.e., where our method did ensure a posteriori that the point was optimal). 
We can see that the backward greedy algorithm exhibits good performance (even in the 
Lasso-inconsistent case) and that our sufficient optimality condition is satisfied as long 
as there is not too much noise. In Figure HI we plot the average mean squared error 
versus cardinality, over 100 replications, using forward (dotted line) and backward (circles) 
selection, the Lasso (large dots) and exhaustive search (solid line). The plot on the left 
shows the results when the Lasso consistency condition is satisfied, while the plot on the 



19 




Cardinality Cardinality 



Figure 2: Left: variance versus cardinality tradeoff curves for o" = 10 (bottom), a = 50 
and a = 100 (top). We plot the variance (solid line) and the dual upper bounds 
from Section 15.31 (dotted line) for each target cardinality. Right: variance versus 
cardinality tradeoff curve for a = 10. We plot greedy variances (solid line), dual 
upper bounds from Section 15.31 (dotted line) and upper bounds computed using 
DSPCA (dashed line). Optimal points (for which the relative duality gap is less 
than lO"'^) are in bold. 



right shows the mean squared errors when the consistency condition is not satisfied. The two 
sets of figures do show that the LASSO is consistent only when the consistency condition is 
satisfied , while the backward g r eedy algorithm finds the correct pattern if the noise is small 
enough ( Couvreur and Bresler . 200d ) even in the LASSO inconsistent case. 



7.3 Sparse recovery 

Following the results of Section 16.21 we compute the upper and lower bounds on sparse 
eigenvalues produced using various algorithms. We study the following problem: 

maximize rr^Sx 
subject to Card(a;) < S 

\\x\\ = 1, 

where we pick F to be normally distributed and small enough so that computing sparse 
eigenvalues by exhaustive search is numerically feasible. We plot the maximum sparse 
eigenvalue versus cardinality, obtained using exhaustive search (solid line) , the approximate 
greedy (dotted line) and fully greedy (dashed line) algorithms. We also plot the upper 
bounds obtained by minimizing the gap of a rank one solution (squares), by solving the 
semidefinite relaxation explicitly (stars) and by solving the DSPCA dual (diamonds). On 
the left, we use a matrix S = F^F with F Gaussian. On the right, S = uti"^/||M|p + 2V, 
where Ui = i = 1, . . . ,n and V is matrix with coefficients uniformly distributed in [0, 1]. 
Almost all algorithms are provably optimal in the noisy rank one case (as well as in many 
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Figure 3: Backward greedy algorithm and Lasso. We plot the probability of achieved (dot- 
ted line) and provable (solid line) optimality versus noise for greedy selection 
against Lasso (large dots), for the subset selection problem on a noisy sparse 
vector. Left: Lasso consistency condition satisfied. Right: consistency condition 
not satisfied. 




Figure 4: Greedy algorithm and Lasso. We plot the average mean squared error versus car- 
dinality, over 100 replications, using forward (dotted line) and backward (circles) 
selection, the Lasso (large dots) and exhaustive search (solid line). Left: Lasso 
consistency condition satisfied. Right: consistency condition not satisfied. 
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Cardinality Cardinality 

Figure 5: Upper and lower bound on sparse maximum eigenvalues. We plot the maximum 
sparse eigenvalue versus cardinality, obtained using exhaustive search (solid line) , 
the approximate greedy (dotted line) and fully greedy (dashed line) algorithms. 
We also plot the upper bounds obtained by minimizing the gap of a rank one 
solution (squares), by solving the semidefinite relaxation explicitly (stars) and by 
solving the DSPCA dual (diamonds). Left: On a matrix F^'^F with F Gaussian. 
Right: On a sparse rank one plus noise matrix. 



of the biological examples that follow), while Gaussian random matrices are harder. Note 
however, that the duality gap between the semidefinite relaxations and the optimal solution 
is very small in both cases, while our bounds ba sed on greedy solutions ar e not as good. 
This means that solving the relaxations in ([9]) and d'Aspremont et al. ( 2007bl ) could provide 
very tight upper bounds on sparse eigenvalues of random matrices. However, solving these 
semidefinite programs for very large values of n remains a significant challenge. 



7.4 Biological Data 



We r un the algorithm of Section 13.31 on two gene expre ssion data sets, one on Colon cancer 
from lAlon et al.l (|l999l l , the other on Lymphoma from lAlizadeh et al.l (|200d ) . We plot the 
variance versus cardinality tradeoff curve in figure [U together with the dual upper bounds 
from Section [5.31 In both cases, we consider the 500 genes with largest variance. Note that 
for many cardinalities, we have optimal or very close to optimal solutions. In Table [U we 
also compare the 20 most important genes selected by the second sparse PCA fact or on the 



colon cancer data set, with the top 10 genes selected by the RankGene software by lSu et al 
( 20031 ). We observe that 6 genes (out of an original 4027 genes) were both in the top 20 
sparse PCA genes and in the top 10 Rankgene genes. 



8. Conclusion 

We have presented a new convex relaxation of sparse principal component analysis, and 
derived tractable sufficient conditions for optimality. These conditions go together with 
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25.1 


H43887 


Complement fact. D prec. 


10 


2.1 


M63391 


Human desmin 


12 


32.3 


T47377 


S-IOOP Prot. 



Table 1: 6 genes (out of 4027) that were both in the top 20 sparse PC A genes and in the 
top 10 Rankgene genes. 




Cardinality 



Figure 6: Variance (solid lines) versus cardinality tradeoff curve for two gene expression 
data sets, lymphoma (top) and colon cancer (bottom), together with dual upper 
bounds from Section 15.31 (dotted lines). Optimal points (for which the relative 
duality gap is less than 10~^) are in bold. 



efficient greedy algorithms that provide candidate solutions, many of which turn out to be 
optimal in practice. The resulting upper bounds also have direct applications to problems 
such as sparse recovery, subset selection or LASSO variable selection. Note that we exten- 
sively use this convex relaxation to test optimality and provide bounds on sparse extremal 
eigenvalues, but we almost never attempt to solve it numerically (except in some of the 
numerical experiments), which would provide optimal bounds. Having n matrix variables 
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of dimension n, the problem is of course extremely large and finding numerical algorithms 
to directly optimize these relaxation bounds would be an important extension of this work. 



Appendix A. Expansion of eigenvalues 

In this appendix, we consider various results on expansions of eigenvalues we use in order 
to derive sufficient conditions. The following proposition derives a second order expansion 
of the set of eigenvectors corresponding to a single eigenvalue. 

Proposition 9 Let N & S". Let Xq be an eigenvalue of N, with multiplicity r and eigen- 
vectors U G R"^'" (such that U^U = 1). Let A be a matrix in S". // \\A.\\f is small enough, 
the matrix + A has exactly r (possibly equal) eigenvalues around Xq and if we denote by 
{N + A)ao the projection of the matrix + A onto that eigensub space, we have: 



{N + A)ao = XqUU^ + UU^AUU^ + XoUU^A{XoI - N)^ + Ao(AoI - N)^AUU^ 

+UU^AUU^ A{XoI - N)^ + (Aol - N)^AUU^AUU^ + UU^ A{XoI - N)^UU^ 
+AoC/C/^A(AoI - iV)tA(AoI - iV)t + Ao(AoI - iV)tA(AoI - N)^AUU^ 
+Ao(AoI - M)^AUU^A{XoI - M)^ + 0(||A|||) 

which implies the following expansion for the sum of the r eigenvalues in the neigborhood 
of Xq: 

Tr(iV + A)Ao = rAo + Tr[/^A[/ + TrC/^A(AoI- iV)^A[/ 

+Ao Tr(AoI - N)^AUU^A{XoI - N)^ + 0(|| A|||). 

Proo f We use the Cauchy residue formulation of projections on principal subspaces ( Katol . 
19661 ): given a symmetric matrix A^, and a simple closed curve C in the complex plane that 



does not go through any of the eigenvalues of A^, then 

1 f dX 



AI- A^ 



is equal to the orthogonal projection onto the orthogonal sum of all eigensubspaces of A^ 
associated with eigenvalues in the interior of C (|Katol . ll966l ). This is easily seen by writing 
down the eigenvalue decomposition A^ = X^iLi ^i'^i'^J ^ ^^'^ the Cauchy residue formula 
{■^ §c A^T" = 1 if is in the interior int(C) of C and otherwise), and: 



1 f dX 1 f dX 



ET A \ — ^ 



=1 ^ i, A,emt(C) 

bee iRudinI (|l987l ') for an introduction to complex analysis and Cauchy residue formula. 
Moreover, we can obtain the restriction of A^ onto a specific sum of eigensubspaces as: 

1 f NdX 1 f XdX 



1 / NdX 1 / 

A^IIcfA^) = = 

^ ^ 2i7r JcXl-N 2i-K Jc 



XI- N' 
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From there we can easily compute expansions around a given N by using expansions of 
(AI — N)~^. The proposition fohows by considering a circle around Aq that is small enough 
to exclude other eigenvalues of N, and applying several times the Cauchy residue formula. ■ 



We can now apply the previous proposition to our particular case: 

Lemma 10 For any a G R", p > and B = aa'^ — pi, we consider the function F : X ^ 
Tr(XV25xV2)_^ from to R. let x G R" such that \\x\\ = 1. Let y ^ 0. Ifx^Bx > 0, 
then 

F((l - t)xx^ + tY) = x^Bx + -^T^T^Bxx^BiY - xx^) + Oit^/"^), 

x^ Bx 

while if x^'^Bx < 0, then 

F{{1 - t)xx^ + tY) = Tr (^y V2 (^q - Y'I^^ + 0{t^'^). 

Proof We consider X{t) = (1 - t)xx^ + tY . We have X{t) = U{t)U{tf with U{t) = 

^i/2yi/2 )' which implies that the non zero eigenvalues of X{ty^^BX{ty^'^ are the 
same as the non zero eigenvalues of U{t)'^ BU{t). We thus have 

F{X{t)) = Tr(M(t))+, 



with 
M{t) 



(1 - t)x'^Bx - t)V2a.T^yl/2 

il/2(l_t)l/2yT^^ tyi/25yi/2 

x^Bx 0\ ( x^Byi/2\ f-x'^Bx 
0j+* [y^/^Bx J+ V yi/25yi/2 

= M{0) +t^/^Ai+tA2 + 0{t^/^). 

The matrix M{0) has a single (and simple) non zero eigenvalue which is equal to Aq = x'^Bx 
with eigenvector U = (1,0)"^. The only other eigenvalue of M(0) is zero, with multiplicity 
n. Proposition [9] can be applied to the two eigenvalues of M(0): there is one eigenvalue of 
M{t) around x'^Bx, while the n remaining ones are around zero. The eigenvalue close to 
Aq is equal to: 

Tr(M(t))Ao = iTr[/^A2t/ + Ao + tTrC/^Ai(AoI-M(0))^Ai[/ 

+AoTr(AoI - M{0))^AiUU^Ai{\oI - M{0))^ + 0{t^/^) 

= x'^Bx + ^—TrBxx'^B{Y-xx^)+0{t^/^). 
x^ Bx 

For the remaining eigenvalues, we get that the projected matrix on the eigensubspace 
of M{t) associated with eigenvalues around zero is equal to 

(M(t))o = t{l-UU'^)A2{l-UU^) + t{l-UU^)Ai{-M{{)))\l-UU^) + 0{t^l'^) 

tY^/^{B - m^)Y^/^ 
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which leads to a positive part equal to t+ Tr \y'^^^{B - ^^g^)y V2 



. When Bx > 0, 



then the matrix is negative definite (because B = aa — pi), and thus the positive part is 
zero. By summing the two contributions, we obtain the desired result. ■ 
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